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REMARKS 

1 . Information Disclosure Statement 

Applicants will shortly provide a farther copy of references cited in the 
information disclosure statement filed December 7, 2001 under separate cover. Because 
the references were provided with the original submission as evidenced by a stamped 
postcard (previously submitted), and subsequently, and have been apparently lost in the 
PTO on both occasions, it is requested that the references be considered with effect from 
their original submission date of December 7, 2001. It is further noted that, in the 
Response dated July 22, 2002, applicants requested the Examiner to call if she was still 
unable to access the references, but in fact applicants did not learn of this until the present 
office action. 

2. Restriction Requirement 

The Examiner maintains the restriction requirement between groups I and 
II on the basis that Choo et al. demonstrates the special technical feature of a 
polynucleotide. Applicants maintain traverse. The special technical feature unifying the 
Group I and II claims is the selection of a quadruplet of bases within a target sequence, 
for use in the design of a binding protein. For the reasons discussed in the previous 
response, and below, such is not disclosed by Choo et al. Therefore, applicants request 
that the group II claims be rejoined to the group I claims. 

With respect to division between groups II and III, the Examiner maintains 
that the use of nucleic acids in a hybridization assay is not a throw-away utility but rather 
is often necessary for the development of new compounds and procedures. However, it is 
not clear to Applicants how the claimed polynucleotides, which encode non-naturally- 
occurring, engineered proteins, could be used in any type of hybridization assay. For 
example, it is not clear to what the nucleic acids would hybridize. Accordingly, the 
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alleged utility asserted by the office action fails to fulfill the PTO's specificity prong of 
the "specific, substantial and credible" requirement under 35 USC 101. 

3. Obviousness-Type Double Patenting 

Claims 1-24 and 32 stand rejected for obviousness type double patenting 
over claims 1-12 and 30 of copending USSN 09/424,488. The Examiner notes that when 
base 4 is A, the current application specifies Glu, Asn or Val, whereas the copending 
application specifies Gin, which differs from Glu only by the penultimate R group. The 
Examiner also notes that when base 4 is C, the current application specifies Ser, Thr, Val, 
Ala, Glu or Asn, whereas the copending application indicates any amino acid. 

Applicants disagree that the differences the Examiner has identified are so 
minor as to support obviousness-type double patenting. The test of obviousness-type 
double patenting is whether any claim of the cited application renders any of the present 
claims obvious. See MPEP 804 B. 1 . Although the Office Action states that Glu and Gin 
differ only by the terminal R group, this difference has significant effects on the chemical 
and biological properties of these two amino acids, since Glu is negatively charged at 
physiological pH, while Gin is positively charged. Moreover, it is not apparent from the 
claims of the pending application or elsewhere what effect the R groups of Glu and Gin 
have on their nucleotide binding specificity. Without such information, it would not have 
been obvious to substitute Gin with Glu. Similarly, Ser, Thr, Val, Ala, Glu and Asn 
represent a subset of the class of all amino acids specified in the copending case. The 
claims of the copending case provide no reason that one would select one of the 
particular subset of amino acids specified in the present claims rather than other amino 
acids. Absent such a suggestion, the present claims are not obvious from the claims of 
the copending case. 

Applicants also note that the amendments made to the claims in the 
response dated July 22, 2002 were unrelated to double patenting (either statutory or 
obviousness-type). Thus, neither the original nor amended claims provide a basis for 
statutory or obviousness-type double patenting. 
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4. Rejections under 35 USC 112, second paragraph 

Claims 1(b) and 3(g) are said to be indefinite in not specifying the +6 
nucleotide if the -H-2 position is not Asp. However, these phrases are not recite in the 
pending claims. It appears that the Examiner may be referring to a copending case in 
which these phrases were at one time present. 

Claim 2 is said not to have antecedent basis in claim 1 in that claim 1 does 
not reference base 4 to be anything other than A or C. In response, the elements from 
claim 2 have been copied into claim 1 and claim 2 has been amended to refer to the two 
nucleotides occupying base 4 previously specified in claim 1. Thus, effectively the scope 
of claims 1 and 2 has been reversed such that claim 2 is a further limitation of claim 1 
rather than vice versa. 

Claim 3(g) is said to be unclear as the meaning of a "small residue." This 
rejection has been rendered moot by deleting this phrase from the claim. It is noted that 
equivalent teaching is provided by the specification. It is further noted that the scientific 
and patent literature evidence frequent usage and understanding of the term "small amino 
acid" in the art. As evidence of this understanding, applicants attach a chart from 
Livingstone, Comput. Appl. Bio. Sci. 6, 745-56 (1993) defining what is meant by small 
amino acids. Further, a quick search of the USPTO issued patent database reveals over 
350 patents using this terminology. Based on this guidance and common usage of the 
terminology in the scientific literature, a skilled artisan would have no difficulty 
designing a zinc finger protein binding to a target sequence according to the amended 
claims. 

Claim 3(1) is said to be indefinite because of the term "asp." However, 
this term is not used in this claim. Rather the term "Asp" is used. This is consistent with 
the nomenclature for amino acids used in other claims and in the art. 

Claim 3(o) is said to be a typographic error in that the Fig. 6 teaches that 
position +2 is Asp when base 1 is C. However, Fig. 6 does not show nucleotide 
quadruplets at all, much less in which position 1 is occupied by C. Thus, it is not 
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understood how the Examiner finds this teaching in Fig. 6. However, a typographic error 
in 3(o) has been corrected. 

Claim 4 has been amended to refer to claim 3 rather than any preceding 
claim. It is noted that this amendment had been made previously in a preliminary 
amendment dated November 5, 2001. 

Claim 14 has been amended to delete the 

Claim 1 5 has been amended as suggested. 

The amendments to the above claims cure the alleged defects in claims 
depending therefrom. 

For these reasons, withdrawal of the rejection is respectfully requested. 

5. Rejections under 35 USC 102 

Claims 1-7, 9-11, 13, 15-23 and 32 stand rejected as anticipated by Choo 
etal.. 35 USC 102(b). 

As a preliminary matter, it is noted that the patentablity of a design 
method should be judged from the steps of the method recited in the claims rather than 
from the nature of the zinc finger protein resulting from the method. An improved 
method of design is not obvious over a prior method of design simply because the 
improved method of design might result in some of the same proteins as the prior 
method. An improved method of design that has different methods steps than a prior 
method may allow design of some of the same proteins as a prior method, but in addition 
will also allow design of some proteins that would not result from following the prior 
method. If the steps in the improved methods are not suggested by those of the prior 
method, then the improved method is patentable notwithstanding that it may design some 
proteins that are the same as those from the prior method. 

In the remarks that follow, applicants first reiterate the position discussed 
in the last response that the presently claimed methods result in design of some zinc 
finger proteins that are not designed by following the cited reference. Although 
patentability is not determined by the nature of the zinc finger proteins that result from 
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the method, this issue is significant in showing the error in the Examiner's apparent 
position that any difference between the present methods and those of Choo et al. is 
illusory. Applicants then explain the difference in method steps between the present 
claims and those of the cited reference that confer patentability of the presently claimed 
methods. 

The Examiner apparently views any difference between the present claims 
and Choo et al. to be an illusion due simply to different numbering schemes for bases. 
The Examiner presents three numbering systems in a Table on p. 5 of the office action. 
The first line of the table is a quadruplet numbering system labeled "Ex. Nos." which the 
Examiner says she is using as the basis of rejection. The second line of the table is a 
triplet numbering system of Choo et al. The third line of the table is a different 
quadruplet numbering system that is displaced from the system "Ex, Nos." by one base. 
In fact, the system "Ex. Nos" is the same as the numbering system of the present 
application, and that shown in the figures accompanying the last response. This is also 
true of the correspondence between "Ex. Nos" and the 5' Nos. in the second line of the 
Table. Thus, insofar as the Examiner bases her analysis on the Ex. Nos. and their 
correspondence with 5' Nos. of Choo et al., she is assuming the same correspondence as 
applicants. 

Elsewhere in the office action, the examiner proposes different 
equivalences than those described between the first two lines of the table at p. 7, which 
are the only two lines relevant to this analysis. For example, base 2 of the present claims 
is equated with the 3 1 base of Choo et al., (see e.g., office action at p. 9, line 4). Base 1 of 
the present claims is also equated with the 3' base of Choo et al. (see e.g., p. 9, line 6). It 
is clear that the 3' base of Choo et al. cannot be both base 1 and base 2 of a quadruplet. 
In short, the Examiner has based her comparison of rules from Choo et al. on the 
erroneous assumption that base 1 of the present numbering corresponds to a 3' base in 
Choo et al. system. Applicants note numerous other areas of disagreement with the 
Examiner's position that all the rules of the present claims are disclosed by Choo et al., 
but in view of the overriding error in assigning equivalence between base 1 of the present 
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claims and the 3' base of Choo et al. believe it unnecessary to address these other issues 
at this time. 

Finally, when addressing applicants' previous remarks (office action at p. 
10), the office action states that base 4 of a quadruplet (i.e., present numbering) is 
equivalent to the 5 1 base of a triplet as in Choo et al. The office action does not expressly 
say what alignment is proposed between the other bases in a quadruplet and those of 
Choo et al. In any event, if base 4 of a quadruplet is viewed as being aligned with the 5 1 
base of a Choo et al. triplet, it follows that bases 3 and 2 of the quadruplet are aligned 
with the mid and 3 f bases of the Choo et al. triplet, and base 1 (present numbering) is 
aligned with the 5 f base of a different Choo et al. triplet. This alignment is shown in Fig. 
3 A (attached to this Response, which is the same as the upper portion of Fig. 1 previously 
submitted with the Response mailed July 22, 2002). As the Examiner can see, the 
alignment of bases in Fig. 3 A is the same as that shown between Ex. Nos. and 5 ! Nos. in 
the Table at p. 7 of the office action. 

Comparison of Fig. 3 A (in which present base 4 is treated as being the 5' 
base of a Choo et al. triplet) with Fig. 3B ( in which base 4 is treated as being part of a 
quadruplet) shows the consequential effects in design of a zinc finger protein. 1 In Fig. 
3 A (illustrating the Choo et al design rules), the "G" residue is not taken into account in 
the design of a zinc finger. By contrast, in Fig. 3B (illustrating the presently-claimed 
design rules), the "G" occupies position 1 of a quadruplet, thereby specifying that the 
amino acid at position +2 of zinc finger Fl is a Glu. The resulting designs differ in that 
the +2 position of finger Fl is occupied by a Ser using the triplet design rules and a Glu 
using the quadruplet design rules. Therefore, besides being a different method, the use of 
a quadruplet code compared to a triplet code can have a material effect on the resulting 
design of zinc finger proteins. 

The preceding analysis shows that the presently claimed methods can 
result in different designs of zinc finger proteins that the cited Choo et al, reference. 



1 Figs. 3 A and 3B are not part of the application but are attached to this response to illustrate differences 
between the claimed invention and cited reference. 
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Such is significant in showing that any difference between the presently claimed methods 
and Choo et al. is not merely an illusion resulting from a different numbering scheme. 

Applicants now return to the key issue in determining patentability, which 
is that the steps of the claimed methods are not disclosed or suggested by the cited Choo 
et al. reference. The presently claimed methods include steps of selecting a quadruplet of 
bases within a target sequence and then applying design rules to each of the four bases in 
the quadruplet. By contrast, the Choo et al reference, at best, selects a triplet of bases 
and applies design rules to certain triplet sequences. It is not sufficient, for establishing 
obviousness, to say that following the Choo et al. reference may result in some of the 
same zinc finger proteins as following the presently claimed methods, because this 
position bases patentability on the zinc finger proteins that result from the methods rather 
than the method steps themselves. Absent evidence that Choo et al. discloses or suggests 
the steps recited in the present claims, the rejection should be withdrawn. 

6. Rejection under 35 USC 103 

Claims 1-7, 9-1 1, 13-23 and 32 stand rejected as obvious over Choo et al. 
et al in further view of Krizek. Krizek is cited as teaching the peptide sequence of SEQ 
ID NO:6 of the present application. Choo et al. is applied as above. In response, Krizek 
does nothing to remedy the lack of disclosure in Choo et al. regarding selecting 
quadruplets and the consequences of doing so, as discussed above. Therefore, claims 1- 

7, 9-1 1, 13-23 and 32 are not obvious for at least the reasons discussed above. 
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If the Examiner believes a telephone conference would aid in the 
prosecution of this case in any way, please call the undersigned at 650-326-2400. 



Respectfully submitted, 

Joe Liebeschuetz 
Reg. No. 37,505 

TOWNSEND and TOWNSEND and CREW LLP 

Two Embarcadero Center, 8 Floor 

San Francisco, California 941 1 1-3834 

Tel: 650-326-2400 

Fax: 415-576-0300 
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Protein sequence alignments: a strategy for 
the hierarchical analysis of residue 
conservation 



Craig D.Livingstone and Geoffrey J.Barton 1 



Abstract 

An algorithm is described for the systematic characterization 
of the physico-chemical properties seen at each position in a 
multiple protein sequence alignment. The new algorithm allows 
questions important in the design of mutagenesis experiments 
to be quickly answered since positions in the alignment that show 
unusual or interesting residue substitution patterns may be 
rapidly identified. The strategy is based on a flexible set-based 
description of amino acid properties, which is used to define 
the conservation between any group of amino acids. Sequences 
in the alignment are gathered into subgroups on the basis of 
sequence similarity, functionaL evolutionary or other criteria. 
All pairs of subgroups are then compared to highlight positions 
that confer the unique features of each subgroup. The algorithm 
is encoded in the computer program AMAS (Analysis of Multiply 
Aligned Sequences) which provides a textual summary of the 
analysis and an annotated (boxed, shaded and/or coloured) 
multiple sequence alignment. The algorithm is illustrated by 
application to an alignment of 67 SH2 domains where patterns 
of conserved hydrophobic residues that constitute the protein 
core are highlighted. The analysis of charge conservation across 
annexin domains identifies the locations at which conserved 
charges change sign. The algorithm simplifies the analysis of 
multiple sequence data by condensing the mass of information 
present, and thus allows the rapid identification of substitutions 
of structural and functional importance. 

Introduction 

A protein that exhibits key biological functions will commonly 
have homologues sequenced from many different tissues and 
organisms. Accurate multiple sequence alignment of such a 
protein family can highlight the residues of common functional 
and structural importance. The location of identities and con- 
servative substitutions may be used to guide the design of site- 
directed mutagenesis experiments whilst the identification of 
subde patterns of residue conservation can yield improvements 
in the accuracy of secondary and tertiary structure predictions 
(Crawford, et al. % 1987; Zvelebil etal., 1987; Benner and 
Gerloff, 1990; Barton et a/., 1991; Russell et al, 1992). Such 
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l To whom correspondence should be addressed 



analyses of multiple sequence alignments have traditionally been 
performed by eye. However, for large alignments, only the most 
obvious patterns of residue conservation can be easily identified 
by this method. When many long sequences are to be scrutin- 
ized, the task becomes unmanageable, and the risk of missing 
interesting residue substitutions is great. 

A number of computer programs have been developed to aid 
the interpretation of multiple sequence alignments. The pro- 
grams PRETTY and PRETTYPLOT from the GCG package 
(Devereux et aL, 1984) derive consensus amino acid sequences 
and box the largest group of similar residues at each position 
of an alignment. ALSCRIPT (Barton, 1993) allows shading, 
boxing and colouring to be applied to an alignment. Colour is 
also exploited by the SOMAP program (Parry-Smith and 
Attwood, 1991), which colours residues according to which 
user-defined set they belong (e.g. hydrophobic, charged). The 
amino acid variation at a position in an alignment is reduced 
to a single figure of 'variability' by Kabat (1976), 'entropy' 
or 'variation' by Sander and Schneider (1991), 'information' 
by Smith and Smith (1990) and 'evolutionary divergence' by 
Brouillet et at., (1992). In contrast, the novel set-based approach 
described by Taylor (1986), defines the minimal set of physico- 
chemical properties that represent any group of amino acids. 
This principle has been developed by Zvelebil et al. (1987) so 
that the minimal set of amino acids could be encoded as a single 
'conservation number' at each position in the alignment. 
Although very effective at highlighting the overall similarity 
at each position in an alignment, none of these methods deal 
with the problem of quantifying similarities between subfamilies 
within a larger multiple sequence alignment. 

It is frequently desirable to subdivide a protein family on the 
basis of function, origin, sequence similarity or other criteria. 
Indeed, most multiple alignment methods (e.g. Barton, 1990; 
Barton and Sternberg, 1987; Feng and Doolitde, 1987; Higgins 
and Sharp, 1989) first compare all sequences pairwise, then 
automatically cluster the sequences into subfamilies on the basis 
of sequence similarity. Such cluster analysis can readily identify 
the gross similarities between sequences but does not pinpoint 
the residue positions that are responsible for the clustering 
pattern. It may also be difficult to rationalize the clusters 
identified by overall sequence similarity with those implied by 
functional similarity since functional differences may reside in 
a few key residues. Although all previous methods for 
characterizing residue conservation (e.g. Kabat, 1976; Devereux 
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et al., 1984; Taylor, 1986; Smith and Smith, 1990; Parry-Smith 
and Attwood, 1991; Sander and Schneider, 1991; Brouillet 
et a/., 1992) provide a clear overview of conservation across 
an alignment, they do not allow the automatic identification of 
residue positions specific to subgroups of sequences within the 
alignment. 

In this paper we describe an algorithm for the systematic 
identification of residue conservation within aligned protein 
sequences. The algorithm operates in a hierarchical manner, 
by first characterizing conservation on a residue-by-residue basis 
within predefined subfamilies, then between all pairs of sub- 
families. This hierarchical approach highlights positions that 
may be responsible for conferring the specific structural and 
functional properties of the subfamilies. 

Systems and methods 

The hierarchical conservation analysis algorithm is implemented 
in the computer program AMAS (Analysis of Multiply Aligned 
Sequences) written in ANSI-C. AMAS can generate commands 
for the ALSCRIPT program (Barton, 1993), which will 
automatically shade, box and colour a multiple alignment 
according to the identified conservation patterns. AMAS and 
ALSCRIPT have been used successfully on a number of Unix 
platforms. If the graphical display options are required, then 
a Postscript printer or interpreter is required. 

Algorithm 

Quantification of amino acid residue conservation 

We have extended the work of Zvelebil et al. (1987) to give 
a general method for quantifying residue conservation. Our ap- 
proach differs in detail to that described by Zvelebil et al., so 
for the sake of completeness and to avoid possible confusion 
we here describe the protocol used to quantify and compare 
residue conservation. 

Figure 1(a) illustrates a Venn diagram (for details see Taylor, 
1986) which is contained within a boundary that symbolizes 
the universal set of 20 common amino acids (e). The amino 
acids that possess the dominant properties— hydrophobic, polar 
and small (<60 A 3 )— are defined by their set boundaries. 
Subsets contain amino acids with the properties aliphatic 
(branched sidechain non-polar), aromatic, charged, positive, 
negative and tiny (<35 A 3 ). Shaded areas define sets of 
properties possessed by none of the common amino acids. The 
Venn diagram may be simply encoded as the property table 
or index shown in Figure 1(b), where the rows define properties 
and the columns refer to each amino acid. 

Cysteine occurs at two different positions in the Venn 
diagram. When participating in a disulphide bridge (C$_s), 
cysteine exhibits the properties 'hydrophobic' and 'small'. In 
addition to these properties, the reduced form (C s ~h) shows 



polar character and fits the criteria for membership of the 'tiny' 
set. 

When analysing proteins that do not have disulphides, an 
index which represents the properties of reduced cysteine is used 
(see SH2 domain analysis). In proteins where disulphide 
bonding is known to occur, or where the oxidation state of the 
cysteines is uncertain, an index representing cysteine in the 
oxidized form is generally more useful (as in Figure lb). 

The illustrated Venn diagram (Figure la) assigns multiple 
properties to each amino acid; thus lysine has the property 
hydrophobic by virtue of its long sidechain as well as the proper- 
ties polar, positive and charged. Alternative property tables may 
also be defined. For example, the amino acids might simply 
be grouped into non-intersecting sets labelled, hydrophobic, 
charged and neutral. 

Figure 2 illustrates the stages involved in the calculation of 
conservation numbers for a simplified property index (Figure 
2a and b). All of the amino acids are assigned to the universal 
set (e), which in this simple example contains only the charged 
subset, which in turn is broken down into subsets containing 
positively and negatively charged amino acids. This property 
index allows the positions of conserved charges to be identified, 
together with positions where a conserved charge changes 
polarity between different groups of sequences within an 
alignment. 

The amino acids occurring at each position in the multiple 
alignment are recorded (Figure 2d), then tested for the presence 
of each of the three properties (Figure 2b). This is represented 
by the columns of entries for each amino acid (Figure 2e). For 
example, at aligned position 1 1 , the first column in Figure 2(e) 
represents the properties of arginine, the second column the 
properties of tryptophan and so on. Filled circles show the 
amino acid is a member of a property set, empty circles indicate 
non-membership. 

Each property is considered in turn by examining the rows 
of entries in Figure 2(e). If all of the amino acids at a position 
possess the property, then the position shows positive conserva- 
tion; all entries on that property's row in Figure 2(e) will be 
filled circles and a filled circle appears in Figure 2(f). If all 
amino acids at a position lack the property, then the position 
shows negative conservation; all entries on the row in Figure 
2(e) will be empty circles and an empty circle is seen in Figure 
2(f). If the possession of a property varies in the set of amino 
acids being considered, filled and empty circles appear in the 
equivalent row in Figure 2(e), the property is labelled as 
unconsented and a shaded circle is shown in Figure 2(f). 

Two methods are used to quantify conservation at an align- 
ment position using the information stored in Figure 2(f). 
Method 1 is similar to that of Zvelebil et al. (1987) and regards 
as conserved any property that is either positively or negatively 
conserved. The number properties obeying this rule (number 
of filled or empty circles for a position in Figure 2f) is summed 
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Fig. 1. Physico-chemical properties of the amino acids, (a) The 20 common amino acids are shown in terms of 10 physico-chemical properties (Taylor, 
1986; Zvelebil et al., 1987). Grey-filled areas define sets of properties possessed by none of the common amino acids. The hydrophobic, polar and small 
sets dominate the figure. The remaining sets define subsidiary groups. The dotted line joining L to R shows the minimum number of five set boundaries 
which must be crossed in order to change an L to an R in this 10 property diagram (see text), (b) An amino acid property index derived from the Venn 
diagram in (a) (after Zvelebil et al. (1987), treating Cys as C s _ 5 ). The columns represent the amino acids while rows represent properties. Filled circles 
show when an amino acid possesses a property. A represents gap which, in this index, is regarded as having all properties. 
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Fig. 2. Calculation of conservation numbers. The Venn diagram showing the relationship between the amino acids on the basis of charge (a) is converted 
to a property index (b), which is used to analyse the conservation of charged residues in the sequence alignment (c). The amino acids present at each sequence 
position are recorded (d) and tested for each of the properties in the index (e). Columns of filled (presence of a property) and empty (lack of a property) 
circles record the properties of each amino acid in the same vertical order as in the property index. The presence of properties is summed (e), filled circles 
show positive conservation of a property in the group of amino acids, shaded circles show where properties are present in some but not all' of the amino 
acids, and empty circles show negatively conserved properties. A conservation score is arrived at by summing either the number of positively and negatively 
conserved properties (g— method 1) or the number of positively conserved properties alone (h— method 2) (see text). 



to give the conservation number (Figure 2g). In contrast, 
method 2 only counts properties that are positively conserved 
(filled circles in Figure 20 and gives the conservation numbers 
shown in Figure 2(h). 

The method 1 conservation value is a function of the number 
of set boundaries P that must be crossed to visit all the amino 



acids at a position. If a property index contains TV properties 
then the conservation number (C„) is N - P. For example, the 
dotted line in Figure 1(a) joins Leu and Arg and crosses five 
set boundaries, thus for this property matrix, C n (L,R) = 
10-5 = 5. The maximum possible value for the conserva- 
tion number calculated by method 1 is given by the number 
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of properties in the property index (3 for Figure 2b; 10 for 
Figure lb). 

Conservation by method 2 is calculated by counting the 
number of sets common to all amino acids at a position. Leu 
and Arg in Figure 1(a) share no properties; by method 2, their 
conservation number is 0. Asp and Glu in Figure 2(a) are both 
members of the sets charged and positive; their conservation 
number by method 2 is 2. The maximum value for the con- 
servation value calculated by method 2 is the maximum number 
of properties possessed by a single amino acid in the property 
index. 

Treatment of gaps and unusual residues 

Insertions and deletions (gaps— A) are usually tolerated only 
in surface loop regions. Accordingly, gaps are normally given 
all properties in the property matrix so that aligned positions 
that contain a gap are assigned a low conservation value. 

The set-based conservation analysis described here is 
independent of the number of sequences analysed. For example, 
a position in an alignment of 100 sequences that contains 99 
alanines and one lysine will give the same conservation value 
as a position in an alignment of two sequences that has one 
alanine and one lysine. The advantage of this approach is that 
the tolerance of particular physico-chemical properties at a posi- 
tion indicates the likely environment of the amino acids in the 
common fold of the protein family. This reasoning suggests 
that a position that conserves valine in 99 sequences, but also 
shows aspartate is unlikely to be performing a common struc- 
tural or functional role. However, it may sometimes be 
suspected that one or more of the sequences contain errors, or 
that there are errors in the alignment. It is then desirable to 
relax the strict conservation rules. Accordingly, a predetermined 
number of gaps or residues that represent <N% of the total 
at a position may be ignored when calculating conservation 
values. For example, alignment position 3 in Figure 2 is 
predominantly Asp. This position would not be recorded as con- 
served using the charge index due to the presence of a single 
Asn (1/12 or 8.3% of the sequences in the alignment). If a 10% 
threshold for unusual residues is set, then this Asn would be 
ignored when calculating the conservation value (similarly, Val 
at position 10). Positions where unusual residues have been ig- 
nored are reported only as conserved, never as identical even 
if the other residues present are identical (Figure 2, position 
3). It is the ability to quantify the conservation of amino acids 
that gives the set-based approach its major advantage over 
averaging a single property scale, caution must therefore be 
exercised when deciding to ignore gaps and unusual residues. 

Hierarchical conservation analysis 

The procedures described in the previous section are a 
straightforward extension of the principles described by Zvelebil 
et al. (1987) and Taylor (1986). Here we extend the set-based 



method to identify conserved features of sequence subgroups 
within larger protein sequence alignments. 

The starting point for hierarchical conservation analysis is 
the identification of two or more subsets of sequences within 
a multiple sequence alignment. The subsets may be defined by 
grouping on the basis of overall sequence similarity, by func- 
tional similarity, origin or other criteria. Given such groupings, 
the aim is to highlight which residue positions define the uni- 
que properties of each group. 

Figures 3 and 4 illustrate the result of applying hierarchical 
conservation analysis to a nine residue fragment of a 26 
sequence multiple alignment using the 10 property index shown 
in Figure 1 . The dendrogram shown at the left of Figure 3 shows 
the overall similarity between the sequences (i.e. not just the 
nine residues) and clearly splits the sequences into three sub- 
groups labelled A, B and C. 

Conservation numbers are calculated for each alignment posi- 
tion in each subgroup and a conservation threshold is set. This 
reference point is used to put each position within a sub-group 
into one of three classes: (i) identical positions; (ii) conserved 
positions, where the conservation number is greater than or 
equal to the threshold; and (iii) unconsented, where the con- 
servation number is less than the threshold. The choice of 
threshold depends upon the particular conservation index be- 
ing used. For the index shown in Figure I, a threshold of bet- 
ween 6 and 8 normally gives the most informative results. In 
Figure 3, the different classifications using a threshold of 8 are 
illustrated by shading and font changes. For example, in 
subgroup A, identities are shown in white on dark grey at posi- 
tions 2 and 4, conserved positions are in black on light grey 
(positions 6-9), and unconsented positions are illustrated in 
italics on a white background (positions 3 and 5). At position 
1, the identity in all sequences is marked by white on black 
lettering, whilst at position 10 chancery script lettering is used 
to highlight the lack of conservation within all sub-groups. 

Having classified the conservation within each subgroup, all 
pairs of subfamilies are compared and conservation numbers 
calculated for each position in the pairs. In the calculation of 
conservation for a pair of subfamilies, the residues from the 
pair are considered as members of a single group. C n is then 
calculated, as described above, for the composite group accord- 
ing to which method was chosen. The change in conservation 
value that occurs when each pair of subfamilies is brought 
together reflects the similarities or differences in physico- 
chemical properties seen in each subgroup at that position. For 
example, at position 7 of subfamilies A and B the conservation 
values in A, B and A -I- B are 9, showing that the properties 
are conserved within each family, and across both families at 
this position. This is, therefore, a location that exhibits com- 
mon physico-chemical properties between A and B, yet these 
properties are not conserved within group C. Accordingly, this 
may indicate a tertiary structural feature shared between A and 
B, but not C. 
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KEY 

in Identical across all sequences 
H Identical within one sub-group 

Q Conserved within one sub-group 

E Unconsented within one sub-group 

s Unconsented across all sequences 

Fig. 3. Hierarchical conservation analysis. A 10 residue fragment of a multiple sequence alignment of 26 sequences is shown to the right of the figure. The 
relationship between the sequences in the whole alignment is represented by the dendrogram to the left, which shows three sub-groups: A, B and C Each 
position of the groups in the multiple sequence aJignment has been anaJysed for residue conservation using the property index in Figure 1(b). The conservation 
threshold was set to 8. Information about the conservation pattern is given at the foot of the alignment in numerical and graphical form. The representation 
of the alignment and the conservation patterns to the right of the figure were imported directly from the graphical output of the programm AM AS. 
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Fig. 4. Text representation of sequence conservation. With reference to Figure 3. The text representation of the analysis gives a more detailed description 
of the conservation of physico-chemical properties at each alignment position. Each record identifies the sequence position to which it refers (rounded brackets), 
the sub group(s) involved in the pattern being reported, the pair conservation number(s) of those groups where non-identities are reported (rounded brackets), 
the residues present in each group (square brackets) and the properties which are conserved by them and which differ between them. Differences in properties 
between subgroups are reported; the percentage of residues in each subgroup that have a property is shown in square brackets. 



In contrast, at position 8 of subgroups A and C, in order to 
'visit' all members of the combined set of amino acids from 
A + C (DEQR) a minimum of four set borders must be crossed, 



giving a value of C n as 10 — 4 =6. The conservation values 
for A, C and A + C are, therefore 9, 8 and 6 respectively. 
Thus, although properties are conserved within each subgroup 
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at this position, the properties that are conserved differ between 
the subgroups. This type of conservation pattern might highlight 
a position in the protein structure that defines the specificity 
for a substrate. For example, the switch from a predominantly 
negative to positive charge between groups A and C may signal 
increased binding for a negatively charged moiety for the group 
C sequences when compared to group A. 

General rules for linking such substitution patterns to changes 
in three-dimensional structure or function are as yet unknown. 
However, changes in conservation of charge, hydrophobicity 



or amino acid size are likely to be of importance in all protein 
families. 

The result of the pairwise comparison of subfamilies is 
summarized below the alignment in Figure 3. The conserva- 
tion values for the pairs of subgroups are either displayed as 
similarities of differences according to the rules shown in Table 
I. The similarity and difference sections are also summarized 
as histograms. 

The hierarchical clustering approach addresses the problem 
of how to weight the information content of each sequence in 
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Fig. 5. Charge conservation in 40 annexin repeats, (a) The pattern of conserved charge in 40 annexin repeats determined using the charge property index 
described in Figure 2. Only positive property conservation is considered at a conservation threshold of 2, this means that a subgroup position must conserve 
both charge and polarity to be reported. Conserved positions alone are reported in order to highlight the pattern of charged residues; the residues at unconsented 
positions have been masked out. Two gaps, and residues constituting < 10% of a subgroup position have been screened from the conservation calculation. 
Identities and conserved positions are identified according to the shading protocol given in Figure 3. A charge difference is clearly seen in the histogram 
at position 31, reflecting the switch between a conserved E (negative) in repeat 2 and a conserved R (positive) in repeat 4. (b) Text output accompanying 
the analysis in Figure 6(a). The record format used is identical to that used in Figure 4. 



an alignment. At the simplest level, each sequence would be 
treated equally but this relies on the sequences being equally 
diverse throughout the alignment. The use of clustering to derive 



conservation patterns ensures equal weight is given to different 
groups of proteins irrespective of the number of examples of 
each type. Inevitably, this process involves the loss of inforrna- 



753 



CD. Livingstone and G. J. Barton 



tion about the minor sequence variation which is responsible 
for subtle differences in character similar proteins in a subgroup. 
This loss is balanced by the ability to detect the more substan- 
tial changes in conservation which determine the differences 
in properties between the separate subgroups. 

Implementation 

Text representation 

AMAS accepts command line arguments and provides a detailed 
textual breakdown of the conservation within a multiple align- 
ment. Figure 4 illustrates the AMAS textual analysis that cor- 
responds to the alignment shown in Figure 3. Only those 
positions that display conservation of the properties in the chosen 
property index are described. The presentation of the text results 
is hierarchical. Identities are described first (1), followed by 
positions showing conservation of physico-chemical properties 
(2), and unconsented positions listed last (3). Each entry con- 
tains a record of the alignment position (rounded brackets to 
the left), of the subgroup(s) to which it refers and a list of the 
residues in each subgroup cited (square brackets). In addition 
for positions that do not show identities, the properties con- 
served at the position, and those that differ are reported. With 
reference to Figure 4: 

• Identities. Section 1 lists those sequence positions that are 
identical across the whole alignment, between pairs of 
subgroups and within one subgroup. Information is not 
repeated lower down the hierarchy if it has already been 
presented, e.g. the Gly at position 1 in the alignment is not 
also reported as two pairs of identical subgroups or as three 
identical individual subgroups. 

• Conservation of properties. Conservation of physico- 
chemical properties between subgroups (following the same 
redundancy rules as for identities) is reported in section 2. 
The four categories of conserved positions are: (1) all 
subgroups conserve similar properties; (2) pairs of conserv- 
ed subgroups share properties; (3) pairs of conserved 
subgroups have dissimilar properties; and (4) individual 
subgroups are conserved. The properties that are positively 
conserved between pairs of subgroups are listed, as are those 
properties that cause differences between subgroups. For 
each of a pair of different subgroups, the percentage of 
residues that display the differing properties is shown in 
square brackets. 

• Unconsented. There are two divisions, the first for single 
unconsented subgroups and the second for entirely un- 
consented alignment positions. 

Graphical display 

The optional graphical representation of results mimics a hand 
analysis of the alignment using coloured marker pens. In Figure 
3 the alignment is shown divided into three subfamilies. Within 



the subfamilies, at each alignment position, the amino acids are 
appropriately highlighted. Conserved subgroups, subgroups 
showing identity and positions that show identity across the 
whole alignment are labelled. Figures 5 and 6 illustrate the 
graphical representation applied to the annexin and SH2 
domains. 

Three highlighting methods have been explored. Mono- 
chrome methods allow grey shading (Figure 5 and 6) or the 
use of different fonts (not shown) to highlight the differences 
in conservation. Grey shading is preferable for publication, 
whilst unshaded alignments are useful as working copies for 
hand annotation. Colour may be specified as an alternative to 
shading to provide additional visual impact. 

Discussion 

The strategy described in this paper is extremely flexible: it 
allows different physico-chemical properties to be examined in- 
dependently, or in concert. In addition, an alignment may be 
dissected into any combination of subgroups and their relative 
conservation analysed. As with any analytical procedure, the 
strategy is most effective when one has a clear idea of what 
one is looking for. For example: 'What makes subgroup A dif- 
ferent from B and C?', or 'Which residues in subgroup D should 
I change to make D more like A?' If no clear questions have 
been defined, then the general property index (Figure lb) is 
a useful starting point to highlight patterns of residue con- 
servation. This is illustrated in Figure 6 for an alignment of 
67 SH2 domains (Russell et al., 1992). Since SH2 domains are 
cytoplasmic, Cys was assigned the properties of the free amino 
acid (Q_ w ) in this analysis (Figure lb). The alignment is 
divided into eight subgroups on the basis of overall sequence 
similarity. Subgroups 1 -1 (numbering from the top) share 
>20% sequence identity, whilst sequences not fitting into one 
of these subgroups are collected in subgroup 8. The overall con- 
servation of physico-chemical properties is highlighted by the 
histogram at the base of the alignment. The upper histogram 
indicates the normalized frequency of similarities between pairs 
of subgroups, whilst the lower plot shows the frequency of pair 
differences. Dark shading of the histogram indicates the fre- 
quency of pairs of subgroups that show sequence identity. A 
hand analysis of an alignment similar to that shown in Figure 
6 correctly identified the location of the core secondary struc- 
tures, and phosphotyrosine-binding residues (Russell et al., 
1992; Barton and Russell, 1993). Since completion of that study, 
the three-dimensional structures of three SH2 domains have 
been determined by the techniques of X-ray crystallography and 
NMR. The secondary structures of these are illustrated at the 
base of Figure 6 (Booker et aL, 1992; Overduin et al, 1992; 
Waksman et a/., 1992). The conservation histograms clearly 
correspond to the regions of secondary structure, and are helpful 
in identifying patterns characteristic of a-helix and /3-strand. 
For example, at positions 15 and 97, CXXCCXXC patterns 
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(where C = conserved) characteristic of a-helix are clearly 
visible. 

The annexins are a family of proteins that bind phospholipid 
in a calcium-dependent manner. Annexins consist of a variable 
N-terminal sequence followed by four or eight repeats, each 
of ~ 80 amino acids. Inspection of a multiple sequence align- 
ment of 40 repeats identified the unique features of each repeat 
family, and located patterns of residue substitution characteristic 
of the secondary structures (Barton et ai, 1991). Figure 5 il- 
lustrates the application of hierarchical conservation analysis 
to a subset of these annexin repeats. Only conserved charges 
are shown (Figure 5a), and the differences summary clearly 
locates the position of a change in charge sign (position 31). 
This charge swap corresponds to the site of an inter-repeat salt 
bridge (Barton et ai, 1991). Additional charge changes are also 
seen at positions 13, 31 , 40 and 68 as listed in the textual sum- 
mary shown in Figure 5(b). While all these features can be iden- 
tified by hand inspection of the alignment, the process is 
laborious and error-prone. The strategy described in this paper 
reduces the scope for error, allows alternative subgroupings 
to be investigated rapidly, and provides shading and boxing that 
is structurally relevant. 

AM AS and Alscript are available from the authors. 
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